{
  "metadata" : {
    "config" : {
      "dependencies" : {
        "python" : [
          "pandas"
        ]
      },
      "exclusions" : [
      ],
      "repositories" : [
      ],
      "sparkConfig" : {
        "spark.driver.memory" : "2g"
      }
    }
  },
  "nbformat" : 4,
  "nbformat_minor" : 0,
  "cells" : [
    {
      "cell_type" : "markdown",
      "execution_count" : 0,
      "metadata" : {
        "language" : "text"
      },
      "language" : "text",
      "source" : [
        "# Scala Spark to Pandas\n",
        "\n",
        "\n",
        "### Converting a Scala Spark DataFrame to a Pandas DataFrame\n",
        "\n",
        "\n",
        "First, notice that we added `pandas` as a `pip` dependency in the Configuration section above. \n",
        "\n",
        "Now, let's create the Scala DataFrame\n",
        "\n",
        "\n"
      ],
      "outputs" : [
      ]
    },
    {
      "cell_type" : "code",
      "execution_count" : 1,
      "metadata" : {
        "cell.metadata.exec_info" : {
          "startTs" : 1572051603097,
          "endTs" : 1572051607579
        },
        "language" : "scala"
      },
      "language" : "scala",
      "source" : [
        "val scalaDF = spark.createDataFrame(List(1,2,3).map(x => (x, x*2)))\n",
        "scalaDF.show()"
      ],
      "outputs" : [
        {
          "name" : "stdout",
          "text" : [
            "+---+---+\n",
            "| _1| _2|\n",
            "+---+---+\n",
            "|  1|  2|\n",
            "|  2|  4|\n",
            "|  3|  6|\n",
            "+---+---+\n",
            "\n"
          ],
          "output_type" : "stream"
        }
      ]
    },
    {
      "cell_type" : "markdown",
      "execution_count" : 2,
      "metadata" : {
        "cell.metadata.exec_info" : {
          "startTs" : 1572050607379,
          "endTs" : 1572050607398
        },
        "language" : "text"
      },
      "language" : "text",
      "source" : [
        "Next, in Python, we will convert the Scala DataFrame to a Pandas DataFrame through PySpark "
      ],
      "outputs" : [
        {
          "ename" : "java.lang.IllegalArgumentException",
          "evalue" : "No interpreter for text",
          "traceback" : [
          ],
          "output_type" : "error"
        }
      ]
    },
    {
      "cell_type" : "code",
      "execution_count" : 3,
      "metadata" : {
        "cell.metadata.exec_info" : {
          "startTs" : 1572051607601,
          "endTs" : 1572051620603
        },
        "language" : "python"
      },
      "language" : "python",
      "source" : [
        "from pyspark.sql import DataFrame\n",
        "pythonDF = DataFrame(scalaDF, sqlContext) # sqlContext is provided by Polynote\n",
        "pandaDF = pythonDF.toPandas()\n",
        "pandaDF.head()"
      ],
      "outputs" : [
        {
          "execution_count" : 3,
          "data" : {
            "text/html" : [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>_1</th>\n",
              "      <th>_2</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>1</td>\n",
              "      <td>2</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>2</td>\n",
              "      <td>4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>3</td>\n",
              "      <td>6</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain" : [
              "   _1  _2\n",
              "0   1   2\n",
              "1   2   4\n",
              "2   3   6"
            ]
          },
          "metadata" : {
            "name" : "Out",
            "type" : "TypedPythonObject[DataFrame]"
          },
          "output_type" : "execute_result"
        }
      ]
    }
  ]
}